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1 Introduction 



The incorporation of fermionic degrees of freedom in lattice simulations is a very general 
and longstanding algorithmical problem. Since there is no efficient stochastic way for 
calculating the determinant of the fermion matrix, one usually calculates the inverse 
determinant of the inverse fermion matrix, which introduces a non-local action. This 
limits the choice of algorithms in an essential way and so the computational effort for 
simulating fermions is about a factor 100-1000 larger than for bosons. The algorithm 
most commonly used today is the hybrid monte carlo algorithm (HMC) [13], which has 
a much worse volume V and correlation length ^ behaviour than local algorithms for 
bosonic fields. 

In particular in lattice QCD this has made simulations including the effect of light 
sea-quarks extremely difficult. Therefore simulations are usually done with setting the 
fermion determinant to 1 (quenching), thus taking only into account the effects of the 
valence quarks. While this may be a fair approximation for some observables such as 
some aspects of the mass spectrum, considerable effects are expected for example for 
as. 

It seems therefore crucial to find new simulation methods or to improve on the 
existing fermion algorithms. One algorithm which is in the course of development and 
testing is the local bosonic algorithm introduced by M. Liischer [1]. This paper will 
describe some improvements of this algorithm as well as a first study of its behaviour 
on lattice sizes up to IG"*. 

2 The local bosonic fermion algorithm 

Since the algorithm was already discussed in great detail in [1,3], we will give only a 
short introduction and basic notations. 

For the S'f/(2)-LGT with two fiavours of Wilson quarks, the partition function is 
given by 



with M = 1 — KH the Wilson fermion matrix and S'vK[f^] the usual Wilson gauge action. 
All simulations were done with periodic boundary conditions. For convenience, we work 
with Q = C075M with Cq ^ = cm(1 + 8K) instead of M. For cm > 1 we have \\Q\\ < 1 
and furthermore = Q. The method is based on a polynomial approximation of the 
inverse of the fermion matrix. We employ here, as in [3], Chebychev polynomials to 
construct polynomials Pn{x) of even degree n which statisfy 




(2.1) 



xPn{x)-l\<5 Va;G[e, 1] 



(2.2) 



where S is given by 




(2.3) 
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This means that, keeping e fixed, we have to scale n cx — ln{S) and, keeping S fixed, 
n cx Different possible polynomials are discussed in [f,9,f2]. One then writes 

n 

detg2~(detP„(Q2))-i^ Y[{Q-rk){Q-rl) (2.4) 

k=i 

where is the square root of the A;-th root of P„ with Im > 0. Introducing n bosonic 
auxiliary fields, we can rewrite 

/n 
rf[f/]e-5^[C/] -Q rf[<^t]^[<^,]e-l(Q-'-'=)^'=P . (2.5) 
k=i 

We can now apply the standard local heatbath and over-relaxation procedures for both 
the bosonic fields and the gauge field. The first simulations have shown that the auto- 
correlation times are rather large [3,4] due to the fact that n fields are coupled to one 
gauge field and are updated one after the other. This results in the fact that the bosonic 
fields are guided by the gauge field and vice versa, since they are not updated simul- 
taneously. This guidance results in small step sizes and therefore the autocorrelation 
time is found to be proportional to n. On the other hand it was found that surprisingly 
large S, of the order of a few percent on lattices up to 6"^ X 12, still give good results. 
Also the tuning of e seemed not to be critical. 

3 Hybrid Over-Relaxation 

Hybrid over-relaxation is the state of the art algorithm for bosonic systems when no 
cluster algorithm is available [8]. It is believed that this algorithm, which consists simply 
in the mixing of heatbath and over-relaxation sweeps with a ratio 1 : Kq, has a dynamical 
critical exponent 2: ~ 1 if is tuned proportionally to ^. This has been verified for the 
case of SU{3) pure gauge theory. 

In [4] it was discussed that application of hybrid over-relaxation to bosonic and 
gauge fields separately fails, since for fixed bosonic fields the gauge fields have a very 
narrow distribution around the bosonic force and vice versa. However, applying 1 
bosonic heatbath sweep and Hq pairs of gauge and bosonic over-relaxation sweeps a 
substantial improvement can be observed. (The gauge field needs not to be updated 
with a heatbath sweep since ergodicity is ensured by the bosonic heatbath update.) 

In table 1 the autocorrelation time dependence is displayed for several updating 
schemes. Note that these numbers are obtained using the preconditioned algorithm, 
which will be discussed in the next section. The autocorrelation times are given in units 
of matrix- vector multiplications of the fermion matrix Q (see also appendix C). It is 
seen that the autocorrelation time in units of (^(^-multiplications can, in comparison to 
the original version of the update used in [3], be reduced from 45 X 10"^ by a factor of 3 
to ~ 15 X 10"^ in the 30-field case. When adding more over-relaxation sweeps Tint starts 
to rise again for the examined parameters. However, we expect the gain to increase 
with increasing correlation length. 
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Table 1: Autocorrelation times in units of 1000(5(?!)-multiplications on the 4"^ X 8 lattice 
for /3 = 1.75 and K = 0.165. All runs were done with ra = 30, e = 0.0045 (thus S ^ 0.03) 
and Cm = 1.1. The letters in the first column give the type and order of sweeps used 
per iteration, where H is a bosonic heatbath, O a bosonic over-relaxation and h and o 
the gauge updates. A g denotes black-white checkerboarding of the gauge field, b the 
same applied to the bosonic fields. G and B denote the extended checkerboarding. 

Generally we performed a sweep through the lattice as a series of lexicographically 
ordered local updates, i.e. by performing four loops over the four local coordinates. We 
also examined the effect of updating with checkerboard updates. We subdivided the 
lattice into 2 sublattices using the normal black-white checkerboarding and 16 sublat- 
tices with an extended checkerboard which were updated one after the other. For this, 
we coloured each site and link according to the functions 

C{x) = (-1)^1+^2+^3+^4 and (3.1) 
Ce(a.) = ix:2^"'(l-(-ir^)- (3-2) 

We then updated one colour after the other. In the case of the extended checkerboard 
sites and links of the same colour do not interact with each other, i.e. they can be 
updated simultaneously. Due to the term in the action this is not the case for the 
usual checkerboarding. In table 1 the results are presented. No significant improvement 
can be seen, which is in accordance with intuition, since for the gauge field the amount 
of change is mainly given by the bosonic force, which is held constant during the gauge 
update. 

A further interesting possibility, namely to update gauge fields and bosonic fields 
one after the other locally, was not yet tested, since the present implementation of the 
algorithm does not allow to update all n bosonic fields site by site efficiently. In this 
case the effect of checkerboarding might be much stronger. 
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4 Even-odd preconditioning 



The number of fields needed in the approximation depends on the spectral condition 
number k{Q) = X^axiQ) / ^miniQ) of the fermion matrix Moreover, it enters the auto- 
correlation time linearly [4]. It is well known that there are matrices with a determinant 
proportional to that of the fermion matrix which have lower condition number, so called 
preconditioned matrices. For the bosonic theory, however, it is necessary that the re- 
sulting action stays local. A simple and useful possibility which is known to work well is 
even-odd preconditioning [5]. One uses the fact that the off-diagonal part of the fermion 
matrix connects only even to odd sites and vice versa. If we write the fermion field in 
the following way: 

where tpg denotes the part of the field which lives on the odd sites, we can write the 
fermion matrix M as 

The following identity proves to be very useful: 

= det Adet (li - CA-^S) . (4.3) 

Applying it to the fermion matrix, we get 

det M = det (l - K^DoeDeo) ■ (4.4) 




Let us denote the matrix on the right hand side with M. This matrix connects only 
odd with odd sites, so the associated fermion fields live only on half of the lattice sites. 
The eigenvalues of M are connected to the ones of M via 

A = 2A-A2, (4.5) 

and ||M|| is bounded by 1 GAK"^ in contrast to 1 8K for ||M||. Note that while 
1 1 Mo 1 1 = 1 -|- 8K , where Mq is the free fermion matrix, ||Mo|| is generally smaller than 
1 -|- 64/i'^. For the bosonic algorithm we use 

Q = Coj5{l- K^DoeDeo) (4.6) 

with ^0 ^ = cm(1 + 64/i'2), which fulfills = Q and ||Q|| < 1 for cm > 1- 

Unfortunately, connects (next to)"^ nearest neighbors so that a local update step 
becomes very complex. Fortunately, this problem can be circumvented by applying (4.3) 
once again after we applied the polynomial to [7]. We start from 

P{Q') = U(Q' - = U(Q - ^k){Q - r-t). (4.7) 

k k 
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Now we can apply (4.3) to each factor and get 



det (q - r,) oc det . -c.,.KD.^o \ 

^ ' \-CQ-i^KUoe Co75 - rfc j 

Letting Q = cq/cqQ^ we obtain the preconditioned action 

S, = Y.<t>l{Q-Porl){Q-Pork)4>k, (4.9) 

k 

which is very similar to the original one. Pg denotes the projector on the odd sites. 
Explicit formulae for the updates are given in appendix A. One might wonder whether 
the reverse use of (4.3) does not destroy the advantages of the preconditioning, but this 
last transformation only affects the behaviour of the bosonic modes, which did not seem 
to be critical [4]. This is verified by the test results. 

Note that the same trick also applies to other cases of preconditioning, for example 
if we use the inverse free fermion matrix, i.e. M = M^^M. M cannot easily be made 
hermitian, so that we have to use a non-hermitian approximation as described in [9]. 
We then have 

P{M) = \{{M - Zk){M - zl) (4.10) 
k 

and use 



detP(M) o^Wdei {{M - ZkMo){M - zlMo)) 
k 

= Hdet ((M - ZkMa){M^ - zImI) 



(4.11) 



to get again a local action. In the last line we used the fact that 75M75 = M'^ . Since 
this kind of preconditioning is only expected to work well in the case of fixed gauge 
simulations, we did not test it. Note that for calculating the eigenvalues and for a 
global metropolis step, one still has to use and implement M. 



4.1 Numerical results 

To be able to exploit the effect of the decreased condition number k in the polynomial in- 
version, one has to study the lowest and highest eigenvalues of in detail. In figs. 1 and 
2 we display the distributions of \m\n{Q^) and \ma.x{Q^) as well as \m\n{Q^) / ^mmiQ^) 
and k{Q^) /k{Q'^) for various lattice sizes {cm was set to 1). The 4"^ X 8 lattice was run at 
(3 = 1.75 and K = 0.165, the other lattices at /3 = 2.12 and K = 0.15. The distribution 
of Ainin((5^)/Ainin((5^) as Well as k{Q^) /k{Q^) is very narrow, showing that this kind of 
preconditioning works very well for the cases we investigated. With growing lattice size, 
Amin(Q^)/Amin(Q^) approaches the expected value of 3.25 for K = 0.15. Surprisingly, 
the condition number decreases by a factor of ~ 8 instead of the expected factor of ~ 4. 
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Table 2: Autocorrelation times in units of 1000(5(?!)-multiplications on the 4"^ X 8 lattice 
for (3 = 1.75 and K = 0.165. n and e are chosen so that S = 0.03. All except the first 
run are preconditioned. 

The reason is that Xma.x{Q^) is far below the bound, growing slightly for larger lattice 
sizes. This was also seen for SU{3) gauge configurations in [9]. This might indicate that 
the bound is actually too high. The widths of the distributions are roughly proportional 
to l/-\/y, as expected for a global quantity, and already very narrow for small lattice 
sizes, which simplifies the tuning of e and cm- 

In [3] it was seen that observables were quite insensitive to e. We also expect this 
to be the case for the preconditioned case, since the argument was that P{x) will still 
approximate 1/x even for x < e. In fig. 6 we investigated the dependence of several 
observables on e, keeping S = 3%. We find, in agreement to our earlier results, that 
e has little effect on the observables, as expected. The autocorrelation times seem to 
be consistent with a Tint cx nj ^ behaviour, as mentioned in [6]. One should note that 
the difference in the computational effort between the highest and lowest e in fig. 6 is 
roughly a factor of 10. 

In fig. 5 the dependence on cm was investigated, keeping e and b fixed. The 
observables are very sensitive to \ma.^[Q^) when it gets larger than one. The plaquette 
is, as expected, most sensitive, but also the masses show significant deviations. The 
autocorrelations grow rapidly when Ainax(Q^) gets close to one. The reason is that 
there are always roots close to 1, so that \Q — r^p develops small eigenvalues. Thus 
one expects that the corresponding bosonic field with the action \[Q — -Po''fc)<?^fcP also 
develops slow modes. It is therefore necessary to monitor \ma.^[Q^) and make sure that 
it is reasonably below 1. 

We now come to the autocorrelation times of the preconditioned algorithm. In table 
2, various runs for different cm with fixed b are shown. The number of independent 
configurations used was around 200 for each run. The first line contains the original, 
non-preconditioned run. Using the same polynomial in the preconditioned case shows 
no significant effect on Tint. We conclude that the bosonic modes are well behaved and 
the back-transformation (4.8) does not spoil the dynamics of the fields. Adjusting e to 
the new minimal eigenvalue, keeping e/Amin constant decreases the computational effort 
by a factor of ~ 4.5. By setting cm = 0.6 and adjusting the polynomial accordingly, the 
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computational effort does not decrease any more, although n was decreased by a factor 
of 2. It looks like X^axiQ^) is already too large at this cm- A more detailed analysis 
is shown in fig. 4, where, keeping 6 and e/X^miQ^) constant, different cm values were 
simulated. There seems to be an optimal value for cm, coming from a trade-off between 
n and the slow mode coming from Xma.x{Q^)- The dependence is rather weak, however, 
so that it is preferable to chose cm so that (X^axiQ^)) ~ 0.7, probably depending on 
the polynomial used. 

5 Combined Update 

A further possible improvement, which was proposed in [4], is to update the gauge field 
and the bosonic fields simultaneously, giving a larger freedom to the gauge variable. 
When implemented in a local way, this however does not solve the Tint cx n problem, since 
the gauge field is still coupled to 0{n) degrees of freedom which are kept fixed during 
the update. Still one can hope to decrease the proportionality constant considerably. 

Our specific implementation of this idea is to update a gauge link while the bosonic 
fields at the end of the link are integrated out analytically. To be able to proceed to the 
next link we have then to revive these degrees of freedom by an exact update. The idea 
behind this is that the effective action for the gauge link after integrating the bosonic 
fields is still of the same form as the original action and the already existing updating 
methods for the gauge field apply. 

To keep the formulas short, we do not write the index k for the bosonic fields. 
Furthermore, we write 

e-^(^'-) = |d<^e-^(^'-). (5.1) 

We propose the following transition probabilities for a three-step-update for the 
fields U{x, 1^), (l){x) and (^(a; -|- fi): 

(5.2) 

P,{U, 4> ^ U\ 4>') = (5-3) 
P3(f/, 4> ^ U\ 4>') = A^-^ ^(^+'^)'), (5.4) 

where N denotes the appropriate normalization factor (which can depend on f/(//, x) 
etc.). Alternatively, one can use an over-relaxation step for the gauge field, i.e. we have 
an operator M which obeys = 1 and 

^-s(M(u(x,t>:)) M^)M^+m ^ ^-s(u(x,t>:) M^)M^+[^) ) 
and \DM / DU\ = 1 (for simplicity), then 

PiiUA^ U'A') = S^^)^^yS^^+f,)^^+f,y5{U{x,fi) - M{U{x,n)')) (5.6) 
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(5.5) 



also works. It is easy to show that both updates fulfill detailed balance. The combination 
of the three steps for the heatbath case is just 

Pi23{U, (f) U', = 7V-ie5(c^(^-A*)'-^W-^(^+A)')^ (5.7) 

namely a usual heatbath for all bosonic and gauge fields. For the over-relaxed algorithm, 
the effective updating probability is 

Pi23{U,^^U',^') = N-H{U{x,fi) - M(f/(a;,//)'))e-^(^(^'^)''^W'^(^+'^)'). 

(5.8) 

One can easily generalize the update to incorporate more fields, e.g. we can keep for 
example the site 4'{x) integrated out while updating all four directions of U{x^ii) and 
(j){x + jl). One should note that this update contains next-to-next-to-nearest neighbour 
interactions, which makes parallelization difficult. 

Using some technical tricks described in appendix B, it is possible to make one sweep 
through the lattice only roughly 1.3 times slower than one standard iteration consisting 
of one bosonic heat-bath and two bosonic over-relaxation and two gauge over-relaxation 
sweeps. Not all tricks were implemented yet, therefore the autocorrelation times in units 
of Q (^-operations may change by some percent. 

We did some test runs on 4"^ X 8 lattices with K = 0.165 and [i = 1.75 using the 
preconditioned action, the results of which we give in the following table: 
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20(5) 
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0.4720(14) 
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The first line was obtained using one combined over-relaxed updating, in the second 
line, a bosonic over-relaxation step was added. On smaller lattices at smaller K values a 
similar situation was found for the combined heatbath update. If we are very optimistic 
we could say that the efficiency of this update is comparable to that of the hybrid 
over-relaxation case. However, this update might have a different z since the bosonic 
fields are always updated using a heatbath method. It would be preferrable to have a 
combined over-relaxation for both gauge and bosonic fields which travels the maximal 
allowed distance in phase space. 

6 Application to other actions 

All methods are easily applicable to the non-hermitian approximation proposed in [9] 
without substantial changes. Also the application to staggered fermions is straightfor- 
ward. The improvements are also compatible with the global metropolis step proposed 
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in [9,10]. One should note that in the metropolis step, when calculating the action 
difference, one has to use P{Q) and not the transformed nfc(Q ~ Pofk){Q — Poft)- 

More complicated is the case of the Sheikoleslami-Wohlert improved fermion action 
[16] which recently seems to attain more attention in unquenched simulations. There 
the fermion matrix has the form 

where Dee and Dgo are the clover terms 1 + Ki/2cswa^ijT^u which are diagonal in the 
position space indices. The preconditioned matrix takes the form 

M = DeeiDoo - K^DoeD;2Deo). (6.2) 

Here, 75M is no longer hermitian, so we have to use a non-hermitian approximation. 
Note that Dee and Dqo and its inverses are hermitian and commute with 75. Applying 
the same tricks as above, one can obtain the action 

Sb = Y.\{M-I)eerk)^k^ (6.3) 

with 

The bosonic updates are then easily calculated. The gauge update becomes rather 
cumbersome, since Dee contains gauge links and the there is no analytic expression 
for Dee ■ One has then to implement a numerical gauge over-relaxation step like the 
one described in [8]. A different possibility would be to handle Dee with an additional 
auxiliary field and use M' = Dqo — K^DoeDee Deo as the preconditioned matrix. This 
matrix has the property that 75M' is hermitian and allows us to use the hermitian 
approximation. We can then write 

det{Doo-rk-K^DoeD;e' Deo) = det{ ..-^n-i ^^^^'l (6-5) 



and, denoting the right hand side matrix with M' — Pork, use the action 

Si = ' + E |(^' - Pork)^k ' . (6.6) 



k 



7 Conclusions 

We have presented improvements of the local bosonic fermion algorithm proposed by 
M. Liischer. A considerable speedup could be obtained by hybrid over-relaxation and 
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preconditioning. Both algorithms were found to work well in a wide region of volumes. 
They are easily combined with other methods proposed in [6,9]. Also we were able to 
reduce the memory requirements by a factor of ~ 3. We found that the tuning of the 
parameters is not trivial and that the eigenvalues Xmin and Xmax need to be monitored. 
We found with S = 3% excellent agreement of the method with HMC and Kramers 
values [14] up to quite high statistics. The dependence on e was, as expected, found to 
be small. 

As for the case of the combined update, we found a comparable performance to 
the standard updates. There are still modifications which might work better and have 
to be investigated in the future. The update is compatible to the other improvements 
proposed so far. However, its implementation is tedious and is probably no longer 
applicable to Sheikoleslami-Wohlert improved fermion actions. 

A performance comparison between optimized versions of the local bosonic method 
and the Kramers algorithm will be made in [15]. 

This algorithm still bears many possibilities of improvements, especially since its 
structure is relatively simple in comparison with HMC like methods. Its dynamics seems 
to be more easy to understand and allows for a more direct examination of its problems. 
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A Explicit formulae for the updates 

Due to the similarity of the preconditioned action Sf, to the original one we can apply 
essentially the same techniques as for the original program, which were presented in [3] 
and, in great detail, in [2]. Here we will give the complete formulas for the preconditioned 
and combined case. 

For convenience, we define Xe{x) and Xo{x) as 1 for x on an even resp. odd site, 
elsewhere. For the formulation of the bosonic updates in the preconditioned case, we 
write 



Sb[U,(l)] = [(l)k{x)]^Ak{x)(l)k{x) + [Bk{x)]^(l)k{x) + [(l)k{x)]^Bk{x) + constant. 



(A.l) 



The coefficients and are easily evaluated: 



{{Xe{x)rk + Xo{x)rl){[Q4)k\{x) - co75<?!>fc(a;)) . 




(A.3) 
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The heatbath updating can now be written as 

4>k{x) ^ Al^'\x)x - Al\x)Bk{x), (A.4) 
X being a gaussian random SU{2) spinor. The over-relaxation update is defined by 

Mx) ^ -c^kix) - 2A^'{x)Bk{x). (A.5) 
For the updates of the gauge field, we define two spinors ^ and r] as 

^ = [Q(t>k]{x) + Kdoj5U{x, - j^)My). (A.6) 
r] = [Q(^k]{y) + Kdoj5U{x, 12)^1 + (A.7) 

They are independent of U{x^ii) and we can write 

iv{U{x,ii)Fk] =2KcoRe\{xo{x)rl + Xe{x)rk)[4>k{x)]^U{x,ii)-f^{l - ^^)(t>k{y)- 

^tf/(a;,//)75(l - -f^)4>k{y) - [(t>x{x)]'^U{x,ii)-f5{l - 7^)77} . (A. 8) 
More compactly, the action is given by 

tr {U{x, l^)Fk} =v^U{x, ii)w + w'^U{x, ii)v 
=tr |f/(a;, ii)v\wk^ + h.c. 

with 



(A.9) 



w={l- 7^) 



Xo[x)rl + Xe[x)rk ^ , ^ 



Kcq 
2 



(1 - lt?)l?,(t>k{x), 



(1 + itih?,(t>k{y)- 



(A.IO) 



(A.ll) 



To use the standard updates we have to ensure that is an SU{2) matrix. This 

can be done setting 



(A.12) 



(A.13) 



(Ffc)ll = {Hk)ll + [{Hk)22]*, 
{Fk)l2 = {Hk)l2 - [{Hk)2l]* 

with {a and (3 are the color indices) 

{Hk)af3 = [{vk)f3]H'^k)a- 

This works the same way when updating SU{2) subgroups of SU{3) matrices. 
For the combined update, we can write the action in the following way: 

Sb{U{x,fi),(l){x),(l){x + fi)) = ^(l){xy Ax(l){x) + ^(l){x + fi)^Ax+(,(l){x + fi) + 

bl.(l){x) + clU {x, i2)^(l){x) + (l){x + fi)^dl. U {x, i2)^(l){x) + 

/t ~ /t ~ (^-14) 

b ^(f){x + /i) + c xU{x, fi)(f)(x + /i) + h.c. + const. 
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where A, b, b', c, c' and d do not depend on U{x,i2), (l){x) and (^(a; + fi). Then, the 
effective actions take the following form: 

Sb{U{x,fi),(l){x), (l){x + fi)) =^(l){xyA^^f,(l){x) + 



bUix) + {x, i2)U{x) - 4a f -f/ {x, i2)b', + 



h.c. + const. 



Sb{U{x, i2),(l){x), <j){x + jl)) = - clA^\U{x,ii)^bx - c'lA^}JJ{x,ii)b'x + h.c. + const. 



iiv[u{x,n)F^^,] 



+ const. 



(A.15) 
onst. 
(A.16) 



with 



4 


— ^x 


- d d^ 

"'X,fi^X + fl"'X,fM 




= bx- 


"'X,fi^X + fl^ -^i 




— ^x 


dx,iiA^^^b'x. 



F^^x can be calculated exactly as above. The updates then look as follows 

U{x,i2) -^U{x,i2)' (over — relaxed or heatbath), 
Hx) -^A-}J\ - A-^ (bx + U {x, i^Ydx) , 
c^ix + A) -^A-lfx - A;1^ [b'x + U{x, i^y^c'x + U{x, i^yUl^c^ix)' 



(A.17) 

(A.18) 
(A.19) 

(A.20) 



A-x,ii is no longer diagonal in the dirac indices, but its inversion and taking a square 
root can easily be done analytically using the properties of the dirac algebra. Explicit 
formulae for b, c and d are given for the preconditioned case by: 



d 1^ {^x 

CiiiX 



bij.{x 



-2ilK + {xe{x)rk + Xo{x)rlYcoK^^{l - 7,.), (A.21) 
-co/i'(l + 7^)(75[<9<?!>](a; + /i) - cq(]){x + (i) + 2cqKU\x, ii)(]){x)), 

(A.22) 

-coK{l - 7^) (75[<9<?!>](a;) - co(t>{x) + 2coKU^{x)(t>{x + fi)) , 

(A.23) 

B{x) -U{x,ii){c^{x) + d^{x)(]){x + (i)), (A.24) 
B{x + U^{x,ii){c' (x) + dl{x)<j){x)). (A.25) 



B Implementation 

All modifications were implemented on Quadrics Ql and QH2 machines, which have 
a SIMD parallel architecture. For the updates, we apply the same tricks as already 
described in [2,3], namely we use the projector properties of 75(1 ±7^) to minimize the 
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number of SU{2) times vector-multiplications. Furthermore, for the bosonic update of 
a field (^fc at a site x, the term Q'^4'k{x) enters the calculation. We calculate and store 
if) = Q(j)k before the sweep through the lattice and correct this vector after the update 
of each site x and use Q'^(f)k{x) = Qip{x). This also saves a great amount of calculations. 
This trick can also be applied to the gauge and combined updates. Since here all n fields 
enter for the update of one link, one has to store Q4>k for all k simultaneously. To avoid 
too large memory costs, one can store only slices of Q(j). One can for example store 
Q(j) for the timeslices t — l,t and t -\- 1 and then update the timeslice t of the lattice. 
Proceeding to i + 1, one has only to calculate Qcj) at the slice i + 2. 

Another problem for the SIMD architecture is to update the lattice if points at 
local sublattices interact directly, i.e. when a point x on the local sublattice on one 
node interacts directly with the point x on the neighbouring node. This can happen 
when one divides the lattice into too small slices, i.e. when one wants to put an 8"* 
lattice on an 8^ X 4 node cubic machine topology. This is even worse for the combined 
update, where also next-to-nearest neighbours interact, so that a sublattice has to have 
at least 3 sites in each direction. Since all processors have to work synchronously, 
interacting sites will be updates simultaneously, leading to wrong results. When one 
does not split one direction into slices, i.e. it is stored completely on each node, one can 
circumvent this problem. First we divide the processor array in even and odd nodes. 
Let us assume that we store the fields locally in an array phi[x,y,z,t] and that T, 
the temporal extent of the lattice, is even. On odd nodes, the same local coordinate 
can be stored in phi [x,y,z, (t+T/2) mod T] . Now, updating the same local index, one 
updates noninteracting physical sites. When coming to the boundary, one has, on even 
nodes, to add T/2 to the time coordinate and on odd nodes subtract T/2. Modulo T, 
this is the same operation, therefore it is possible to do it synchronously, as required 
for SIMD machines. 



C Data analysis 

Generally, autocorrelation times are expressed in units of Qcj) operations to have compa- 
rable numbers for all updating schemes and lattice sizes. Multiplying the given numbers 
by the time needed by one Qcj) operation one gets the autocorrelation time in units of 
seconds. However, we expect the Qcj) units to be more portable to other machines since 
communication overhead effects and other effects like cache dependencies should be 
roughly equally present in a Qcj) operation and the rest of the updates for most machine 
architectures. 

For the data analysis several methods were applied. Masses were measured using 
the definition 

[ C^jT/2-l) + C^jT/2 + l) \ 
niTT „ = acosh — — — ^ — ; — , C.l 

V 2a,,(r/2) y ' ^ ' 

where C^r and Cp are the appropriate averaged meson correlation functions. No smearing 
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was applied. Autocorrelation times for an observable X were measured using the method 
proposed by Sokal [11], namely 



2, ^(0) 

t= — n-\-m ^ ' 



with 



n-\t\ 

and m chosen so that Tint <^ m <^ n. We chose self-consistently the smallest value of m 
for which m/rint > 4. An estimate for the error of Tint is given by 

2 2(2m+l) _, 

^^mt = ^ ^inf (C.4) 

For the remaining observables, the data was binned and analyzed using jackknife statis- 
tics. Then a plateau was searched in the estimate of the variance and this was taken as 
the estimate for the error. It was checked that this procedure gives compatible values to 
the formula = 2rint(T„j^;yg. The bias found in the jackknife procedure for the masses 
was always at least two orders of magnitude smaller than the error. The thermalization 
was taken to be at least lOrint iterations for systems with random start and Srint for 
pre-thermalized systems. 
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Figure 1: Distribution of the smallest (left) and largest (right) eigenvalues of for a 
16'* (solid), 8^ X 12 (dashed) and 4^ X 8 (dotted) lattice. 
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Figure 2: Distribution of Ainin((5^)/Ainin((5^) (left) and of the quotient of the condition 
numbers k{Q'^)/k{Q'^) (right) for a 16"* (solid), 8^ x 12 (dashed) and 4^ X 8 (dotted) 
lattice. 
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Figure 4: Dependence of the computational effort on cm when keeping S and 
e/('^min(Q^)) constant. The upper figure shows the computational effort in arbitrary 
units for the plaquette (boxes) and C\{T/2) (triangles), the lower figure (X^axiQ^))- 
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Figure 5: Dependence of several observables on cm- From top to bottom there are 
rint(n), the plaquette (the solid line gives the HMC value), the masses (here the cm = 0.6 
regions are shown as dotted lines), Xmin{Q^) and Xma.x{Q^)- Note that in the bottom 
figure the error bars give the width of the distribution of X^axiQ^) rather than the 
errors. 
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Figure 6: In the upper picture Tint for the plaquette (squares) and C\{0) (triangles) is 
shown. In the middle picture the tt and p masses for the 6"^ X 12 lattice and several e 
values are drawn. At e = the HMC points are drawn with a triangle. The dashed 
lines indicate the HMC error bars. The lower picture shows the plaquette. 
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